infectious virus
6754e06e46dfa419d5afe3c9781cecad-AuthorFeedback.pdf
So,thefactthatourtraining8 data comes solely from infectious virus, which would be highly probable (or "grammatical") sequences under our9 language model (LM), isakeyfeature ofourapproach. Importantly,however,we note that,fundamentally,CSCS ispresented ingenerality here sothese methods are19 not strictly "competitor methods" in the sense that, if one were to work better, it would still be incorporable within20 theCSCSframework. "`1 rather than Euclidean": We used`1 since it has nicer properties than, e.g.,`2 in26 high-dimensional spaces(Aggarwaletal.,ICDT,2001)butotherdistance metrics couldbeempirically quantified. "theoretical44 detail"/"how the method works": We apologize for sparsity of detail. "choice of beta": We find good robustness ofβ values reasonably close to 1 (e.g, 0.5-2).56
Review for NeurIPS paper: Learning Mutational Semantics
Weaknesses: There are a few weaknesses that might be helpful to address (also see comment on Correctness): clarification of the notion of "grammaticality" for this problem; further connections to similar approaches for protein modeling that could be considered; the slightly ad hoc nature of the CSCS objective; and the fact that comparisons made to similarly high-capacity deep unsupervised models in the Appendix did not use viral data. These are explained in further detail below: Appropriateness of "grammaticality" for the viral immunological escape problem: I appreciate the trend of using massive amounts of unsupervised data to circumvent the difficulty of obtaining fitness measurements for biological sequences, which this work also advances. However, there is some degree of implicit supervision involved here, in that the amino acid sequences used (described in Lines 199-200) are explicitly from infectious viruses (rather than somehow being neutral/benign). It's also not clear that casting this observation as merely an issue of "grammaticality" makes sense: if a "grammatically correct" sequence is one that belongs to an infectious virus, what's the difference between grammaticality and semantics (which are also supposed to capture what makes a sequence infectious)? Perhaps one could claim that grammaticality in this context has to do with whether the protein is "valid", in the sense that it folds or is stable, but this is not explained, and does not absolve the first point that all the sequence data come from infectious viruses (rather than, for example, all valid protein variants that fold/are stable, which would allow for a much clearer distinction between grammaticality and semantics).